Description

This R notebook is a bioinformatics pipeline to analyze data obtained by sorting a barcoded transposon library in Ralstonia eutropha (a.k.a. Cupriavidus necator) by density, achieving a selection by PHB content. For background and details regarding the method, see Wetmore at al., mBio, 2015 and Price et al., Nature, 2018). This notebook has been adapted from a pipeline for analysis of chemostat competition experiments by Michael Jahn.

This notebook analyzes data from the second N-starved PHB sortig experiment. The first experiment har large problems with low barcode diversity and enrichment of unmapped barcodes. This lead to less a mapping ratio of less than 10 %, i e few of the reads could be mapped to a known barcode. For this reason the experiment was repeated, and the rerun came out with normal diversity.

Load data

Libraries

Due to the low mapping, the N starvation was redone and resequenced together with some of Michaels experiments.

Get gene annotation and essentiality data on fructose and formate

Summary statistics

First some diagnostics. Overview about the number of reads per sample and their mapping. This experiment has a normal fraction of mapped reads of about 75 %, which gives more than 6M mapped reads in all samples but two.

Barcode diversity

In the previous PHB sorting experiment, the diversity of the reads was low compared to other types of BarSeq experiments from before. This reduced the number of useful reads per gene. Here we compare the cumulative distribution with experiment 1, and with an older MC sample. We see that the barcode diversity is most similar to the MC sample, and much better than in experiment 1.

Reads per barcode and gene

Distribution of number of reads per barcode. Most samples have on average about 7^2=128 reads per barcode, which is good for quantiation.

Similarly to the above, this is an overview about the number of barcodes per gene as a histogram. This distribution is the same for all conditions and replicates. The second plot is the number of reads per gene, averaged as median over all conditions (excluding Fraction 0 where counts were averaged by BarSeq pipeline).

The average of detected barcodes per gene is 6.3, and the median is 5. The average reads per gene is 874, and the median is 195.

Gene fitness analysis

Depletion between fractions

We can plot log2 FC or normalized gene fitness in every fraction. For this type of overview it is best to summarize individual replicates (3x) to the mean or median, per time point and condition. We also add genome annotation to the summary table.

As sort of an internal control, we compare the gene fitness obtained by a complex procedure to the log2 FC of read counts, which is a very simple measure of ‘fitness’. The two variables correlate well. The correlations are a bit better with means, without giving more outliers, and I will use means from now on.

Enrichment and depletion in the top F1 fraction

The confidence of the enriched mutants in the F1 fraction is low. There are no genes with a score over 3, although the fact that there is some correlation between Formate and Fructose samples hints that these are actually enriched here. The low confidence levels make it hard to draw any conclusions here.

Enrichment and depletion in the bottom F3 fraction

What if we look at the F3 fraction, where the main band is? Here, we see a similar pattern with a correlation between the substrates. However, the scores are higher and many more mutants reach significance.

Most depleted genes in F3

Since the scores are more significant in F3, we choose to look closer at those genes. Here we plot those that are at least depleted with a score of -2 in any medium, sorted by their combined depletion scores in both substrates. We also remove genes that had previously been determined to be essential in either substrate, since their signals could just mean that they fail to metabolize the carbon source during the N starvation and thus did not produce PHB. The essentiality threshold was set to a depletion score of -3 after 8 generations. Some of the genes deemed not essential were still highly deleterious (score <-2), and could be essential in practice. The score after 8 generations in a chemostat is plotted in each subpanel.

Protein interactions

To learn more about functional relationships between enriched genes/mutants, we can submit the gene list to the STRING interaction database and retrieve a network of probable interactions. Copied from Micheals BarSeq-pulse.

Submit the most depleted genes to String! We do this twice, once including all genes and once those essential in both substrates.

Let’s look at genes by their interaction groups. First is the large amino acid metabolism interaction grop. These are genes that relate to the synthesis of

  • Phe, Typ, Trp synthesis (aroQ, aroB, pheA, aroE, aroA, aroC, aroL)
  • Gly, Ser, Thr (trpA)
  • Histidine (hisA)

These are mostly essential in minimal media.

## # A tibble: 9 x 4
##   locus_tag gene_name        Process          Pathway                           
##   <chr>     <chr>            <chr>            <chr>                             
## 1 H16_A0792 pheA H16_A0792   Amino acid meta
 Phenylalanine, tyrosine and trypt

## 2 H16_A0795 aroA H16_A0795   Amino acid meta
 Phenylalanine, tyrosine and trypt

## 3 H16_A1317 aroC H16_A1317   Amino acid meta
 Phenylalanine, tyrosine and trypt

## 4 H16_A2612 trpA H16_A2612   Amino acid meta
 Glycine, serine and threonine met

## 5 H16_A3161 aroE H16_A3161   Amino acid meta
 Phenylalanine, tyrosine and trypt

## 6 H16_A3170 aroQ1 aroQ H16_
 Amino acid meta
 Phenylalanine, tyrosine and trypt

## 7 H16_A3411 hisA H16_A3411   Amino acid meta
 Histidine metabolism              
## 8 H16_A3434 aroB H16_A3434   Amino acid meta
 Phenylalanine, tyrosine and trypt

## 9 H16_A3435 aroL aroK H16_A
 Amino acid meta
 Phenylalanine, tyrosine and trypt


There is one group of three genes that are in the central carbon metabolism or in the pentose pathway. Notably, they are depleted in both fructose and formate.

## # A tibble: 3 x 4
##   locus_tag gene_name      Process                 Pathway                     
##   <chr>     <chr>          <chr>                   <chr>                       
## 1 H16_A1374 pdhA H16_A1374 Carbohydrate metabolism Glycolysis / Gluconeogenesis
## 2 H16_B1213 eda H16_B1213  Carbohydrate metabolism Pentose phosphate pathway   
## 3 H16_B2565 pgl H16_B2565  Carbohydrate metabolism Pentose phosphate pathway

Then there are four small groups. None of these are well known as far as I can tell. Tentative functions are listed below. The fourth group is entirely unknown. These are generally only depleted in fructose, not formate.

## # A tibble: 2 x 4
##   locus_tag gene_name      Process            Pathway             
##   <chr>     <chr>          <chr>              <chr>               
## 1 H16_B1498 frcA H16_B1498 Membrane transport ABC transporters    
## 2 H16_B1500 frcB H16_B1500 Cell motility      Bacterial chemotaxis
## # A tibble: 2 x 4
##   locus_tag gene_name       Process           Pathway                       
##   <chr>     <chr>           <chr>             <chr>                         
## 1 H16_A1071 cyoA1 H16_A1071 Energy metabolism Oxidative phosphorylation     
## 2 H16_A2508 H16_A2508       Lipid metabolism  Glycerophospholipid metabolism
## # A tibble: 2 x 4
##   locus_tag gene_name Process            Pathway       
##   <chr>     <chr>     <chr>              <chr>         
## 1 H16_B2037 H16_B2037 Cellular community Quorum sensing
## 2 H16_B2040 H16_B2040 Cellular community Quorum sensing
## # A tibble: 2 x 4
##   locus_tag gene_name Process      Pathway     
##   <chr>     <chr>     <chr>        <chr>       
## 1 H16_A1372 H16_A1372 Hypothetical Hypothetical
## 2 H16_A1373 H16_A1373 Hypothetical Hypothetical

Then there are genes that are outside of the interaction groups. Importantly, this includes the PBH polymerase phaC.

  • phaC PHB polymerase
  • hisP Transporter for histidine?
  • phbH Carbon metabolism?
## # A tibble: 11 x 4
##    locus_tag gene_name         Process                          Pathway         
##    <chr>     <chr>             <chr>                            <chr>           
##  1 H16_A0045 hisP H16_A0045    Signaling and cellular processes Transporters    
##  2 H16_A0325 phbH ptsH H16_A0
 Signaling and cellular processes Transporters    
##  3 H16_A0801 lapB H16_A0801    Unclassified: metabolism         Glycan metaboli

##  4 H16_A1437 phaC phbC H16_A1
 Carbohydrate metabolism          Butanoate metab

##  5 H16_A1737 H16_A1737         Hypothetical                     Hypothetical    
##  6 H16_A2334 H16_A2334         Hypothetical                     Hypothetical    
##  7 H16_B0642 H16_B0642         Xenobiotics biodegradation and 
 Steroid degrada

##  8 H16_B1409 H16_B1409         Hypothetical                     Hypothetical    
##  9 H16_B1553 H16_B1553         Hypothetical                     Hypothetical    
## 10 H16_B1693 H16_B1693         Xenobiotics biodegradation and 
 Benzoate degrad

## 11 H16_B2337 H16_B2337         Hypothetical                     Hypothetical

Most enriched genes

In the opposite corner of the scatter plot, those that are overrepresented in the bottom fraction, we find weaker scores but if we put the threshold at 1.5 we have a few to work with. The only essential gene here was bfd in fructose. Not all genes here show a clear pattern with increasing density further down. The most enriched, B2054, is also enriched in F1 with fructose. Therefore these should be interpreted with caution.

## # A tibble: 14 x 4
##    locus_tag gene_name       Process                       Pathway              
##    <chr>     <chr>           <chr>                         <chr>                
##  1 H16_B2054 H16_B2054       Hypothetical                  Hypothetical         
##  2 H16_B2043 H16_B2043       Hypothetical                  Hypothetical         
##  3 H16_A1009 H16_A1009       Nucleotide metabolism         Purine metabolism    
##  4 H16_B0989 H16_B0989       Hypothetical                  Hypothetical         
##  5 H16_A0371 rplY ctc H16_A
 Genetic information processi
 Ribosome             
##  6 H16_A1446 H16_A1446       Hypothetical                  Hypothetical         
##  7 H16_A0327 bfd H16_A0327   Unclassified: signaling and 
 Others               
##  8 H16_A1886 H16_A1886       Hypothetical                  Hypothetical         
##  9 H16_B0227 H16_B0227       Genetic information processi
 Chromosome and assoc

## 10 H16_A2127 H16_A2127       <NA>                          <NA>                 
## 11 H16_A1389 H16_A1389       Hypothetical                  Hypothetical         
## 12 PHG377    PHG377          Hypothetical                  Hypothetical         
## 13 H16_A0531 rhlE1 rhlE H16
 Folding, sorting and degrada
 RNA degradation      
## 14 H16_B1774 H16_B1774       Hypothetical                  Hypothetical

Submit the most enriched genes to String! The only interaction detected is between the un-named possible transcription factors A1886 and A1389.

Comparsion between Wetmore and Deseq scores

The pipeline allows for two methods to calculate enrichment or depletion scores: Wetmore (2015) or the Deseq2 method. In the folder results_deseq, the scores have been computed by the simpler Deseq2 method.

Data import and processing

The Deseq2 pipeline output is slightly different. The replicates are already averaged, and the per-fraction (or per-timepoint) scores are in the log2FC column.

We want to see the correlation with the Wetmore method. The methods seem to be largely correlated, but less so in F2 and individual datapoints stand out in all fractions. Especially in depleted genes are there many examples where the Deseq2 score is negative but close to zero.

Let’s look at individual genes. There are single datapoints that are different, but there is not no drastic changes. The conclusion is that the results are robust to the methods used (Deseq2 or Wetmore), which is reassuring.